witold–szymanski/data–analysis–in–core–facility– environment

Vertiefungsmodul Proteomics

Motivation

In the core facility proteomics we perform protein identification, quantification and characterization of proteins, peptides, and post-translational modifications (PTMs).

  • sample origin

    • cell lysates
    • supernatant
    • extracellular vesicles
    • IP…
  • approach

    • label-free with DIA for standard samples

    • TMT-labeling for phosphoproteomics and other modifications

Proteomic experiment

Proteomic experiment

Proteomic experiment

Proteomic experiment

Proteomic experiment

Proteomic experiment

Proteomic experiment

Proteomic experiment

Sample complexity

Chromatogram of one of the IP samples that were a subject of your previous exercise.

Sample complexity

Chromatogram of a typical human complex sample.

Sample complexity

Sample complexity

Sample complexity

Spectral counting Vs Intensity calculation

Spectral counting Vs Intensity calculation

Search engine

Commonly we use MaxQuant or DiaNN platforms for spectra identification.

Tons of alternatives:

  • FragPipe
  • ProteomeDiscoverer
  • PEAKS
  • Mascot
  • Skyline
  • Spectronaut
  • Progenesis

Big data analysis

  • scaffold file (4 samples)

  • big data report (20 samples)

  • takes 5 minutes to open it in you favorite MS Excel like software

  • table contains 606934 rows

Use programming

Our go-to is: R is a language and environment for statistical computing and graphics:

  • provides a wide variety of statistical techniques

    • linear and nonlinear modelling
    • classical statistical tests
    • time-series analysis
    • classification
    • clustering

Use programming

Our go-to is: R, a language and environment for statistical computing and graphics:

  • provides a wide variety of statistical techniques

  • beautiful graphical output

Use programming

Our go-to is: R is a language and environment for statistical computing and graphics:

  • provides a wide variety of statistical techniques

  • beautiful graphical output

  • highly extensible

Use programming

This presentation is also written in coding language called Quarto, using reveal. js,an open source HTML presentation framework

Data analysis - Example

Experimental design:

  • 20 samples
  • 4 conditions
  • 5 replicates per condition
  • EV (extracellular vesicles) enrichment

Step1 - Filtering

Similar to the Scaffold workflow:

  • protein and peptide matching FDR already filtered to 1% at the search engine level
  • additional filtering of the data
Loading required package: data.table
                                             Filter    n
                                                    5350
                                      Reverse == "" 5350
                      `Potential contaminant` == "" 5251
expr > 0, for at least two samples in some subgroup 5210

Spectra identification software identified 5350 protein groups. After dropping 99 potential contaminants, 41 without replication (within subgroup), 5210 proteingroups were retained for further analysis.

Step2 - Quality control

Step2 - QC - PCA

pal_paired = brewer.pal(12, "Set3")

biplot(pca(object,
   assay = (object)[1]), 
   shape = NULL,
   label = "replicate",
   palette = pal_paired)+ 
   ggtitle("log2.MaxLFQ")

Step2 - QC - profile plots

ggplot(
  data=PG_log_org, 
  aes(x=Sample, 
      y=PG.MaxLFQ, 
      group=gene)) +
  theme_lineplot+
  labs(
    title="Protein line plots", 
    subtitle="MaxLFQ",  
    y = "Intensity (log2)")+
  geom_point(
    data=subset(t_sel), 
    size=2, 
    aes(color=gene),
    alpha = 0.9)+
  geom_line(
    data=subset(t_sel), 
    size=1,
    aes(color=gene),
    alpha = 0.7)

Step2 - QC - profile plots

ggplot(
  data=PG_log_org, 
  aes(x=Sample, 
      y=PG.MaxLFQ, 
      group=gene)) +
  theme_lineplot+
  labs(
    title="Protein line plots", 
    subtitle="MaxLFQ",  
    y = "Intensity (log2)")+
  geom_point(
    data=subset(t_sel), 
    size=2, 
    aes(color=gene),
    alpha = 0.9)+
  geom_line(
    data=subset(t_sel), 
    size=3,
    aes(color=gene),
    alpha = 0.7)

Step2 - QC - Correlation matrix

t_int=df_wide[,11:ncol(df_wide)]
res <- cor(
  t_int, 
  use="complete.obs", 
  method = c("pearson"))
corrplot(
  res, 
  type = "full", 
  order = "hclust", 
  method = 'color', 
  tl.col = "black", 
  tl.srt = 45)

Step2 - QC - Number of identified proteins

ggplot(
  data=PG_log_org, 
  aes(Sample)) + 
  labs(
    title="Number of identified 
           protein groups")+
  geom_bar(aes(fill=subgroup))+
  ylim(0,max(
    table(PG_log_org$Sample)
    )*1.3 )+
  geom_text(
    stat='count', 
    aes(label=..count..), 
    hjust=-0.2)+
  theme_bw()  + 
  theme(
    axis.text.x = 
      element_text(angle = 90)
    ) +
  theme(legend.position = "none")+ 
  coord_flip()

Step2 - QC - Number of identified proteins

plot_sample_violins(object)+ 
  theme(legend.position="none")

Step3 - Statistics

Differential Expression Analysis quantifies whether condition differences are significant.

pvalue is a probability that the difference between groups arose due to random sampling.

contrastdefs <- c('Primed - Basal',
                  'Physioxia - Basal',
                  'AcHypx - Basal')

design  <- create_design(
  object, 
  formula = ~ 0 + subgroup, 
  drop = TRUE)
object %<>% fit_limma(
  design = design, 
  contrasts = contrastdefs)

plot_volcano(
  object, 
  coefs = contrastdefs,  
  max.overlaps = 10, 
  label = 'cleanprotein', 
  features = df_selection, 
  nrow=1, 
  size = "log2.MaxLFQ", 
  xlabel = -5) + 
  ggtitle("Volcano plot")

Step3 - Expression profiles

Differential Expression Analysis quantifies whether condition differences are significant.

pvalue is a probability that the difference between groups arose due to random sampling.

object$subgroup %<>% factor(unique(.))
plot_exprs(
  object_exprs[df_selection, ], 
  block = 'replicate', n=Inf,
  facet='feature_id')+
  ggtitle(NULL)+
  theme(legend.position = "none")

Step4 - Missing values

Importance of replication

Step4 - Missing values

There are two types of NAs:

systematic NAs: missing completely in some subgroups but detected in others (for at least half of the samples). These represent potential switch-like responses.

random NAs. They are missing in some samples, but the “missingness” is unrelated to subgroup. These samples do not require require imputation for statistical analysis to return pvalues.


In this dataset:

  • 422 proteingroups have systematic NAs

  • 2024 proteingroups have random NAs

  • 2764 proteingroups have no NAs

Step4 - Missing values

Step5 - Statistics after imputation

Step6 - GO term

GO stands for Gene Ontology and as the name suggests, it annotates genes using an ontology. It is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species.

Conclusions

Learn coding language :)


Thanks for the attention!

Conclusions

Conclusions

[<start> idea]-:> [<phospho>experimental design]
[<phospho>
    experimental design | sample origin |  
    quantification strategy |correct  controls|number of replicates] 
    -:> [<condb>sample preparation]
[<condb>
    sample preparation | lysis | digestion | fractionation  | purification | enrichment ] 
    -:> [<boxa>chromatography]
[<boxa>
    chromatography | gradient length | column | elution profile]  
-:> [<boxb>mass spectrometry]


[<table> mass spectrometry] 
  -:> [<conda>spectra identification]

[<conda>spectra identification] 
  -:> [<boxd>data analysis ]

[<start> idea] 
  <-- [<boxd>data analysis]

[<table> mass spectrometry] 
  -:> [<boxc>
         instrument settings | resolution | mass range | mass accuracy | speed | ionization energy | fragmentation method]
[<table> mass spectrometry] 
  -:> [<table>acquisition mode | resolution | mass range | mass accuracy | speed | ionization energy | fragmentation method]
[<boxc>acquisition mode] 
  <--> [<boxc>instrument settings]

Where to find us

Institute of Translational Proteomics

Core Facility Translational Proteomics

Address

Philipps–University Marburg
Department of Medicine
Biochemical/Pharmacological Centre, Building K|03
Karl–von–Frisch–Straße 2
35043 Marburg
GERMANY

Email Address

translational.proteomics@…

Geolocation

50.8044, 8.80745

Site Plan (With Transportation Options)

Questions

  • what is needed in order to allow for statistical test? (replication)
  • what are different ways to quantify proteins analyzed by mass spectrometry? (spectral counts, area of peak elution)
  • what are two types of missing values (systematic and random)